Internationa) Bureau 



INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 6 : 



A2 



(11) International Publication Number: WO 98/12881 

(43) International Publication Date: 26 March 1998 (26.03.98) 



(21) International Application Number: PCT/US97/ 17132 

(22) International Filing Date: 22 September 1997 (22.09.97) 



(30) Priority Data: 

60/025,304 



20 September 1996 (20.09.96) US 



(71) Applicant: NETBOT, INC. [US/US); 4530 Union Bay, N.E.. 

Seattle, WA 98105 (US). 

(72) Inventors: CHRIST1 ANSON , David; 5035 15th Avenue N.E. 

#102, Seattle, WA 98105 (US). DOORENBOS, Robert, 
B.; 1 154 N.W. 59th Street #A32, Seattle, WA 98107 (US). 
ETZIONl, Oren; 5820 57th Avenue N.E., Seattle, WA 
98105 (US). KWOK, Chung; 4801 24th Avenue N.E. #535, 
Seattle, WA 98105 (US). LAUCKHART, Gregory; 619 
N.E. 75th Street, Seattle, WA 98105 (US). SELBERG, 
Erik; 3516 N.E. 75th Street #10. Seattle, WA 98115 (US). 
WELD, Daniel, S.; 4315 N.E. 43rd Street, Seattle, WA 
98105 (US). 

(74) Agents: MORRIS, Francis, E. et al.; Pennie & Edmonds LLP, 
1 155 Avenue of the Americas, New York, NY 10036 (US). 



(81) Designated States: AL, AM, AU, AZ. BA, BB, BG, BR, BY, 
CA, CN, CU, CZ, EE, GE, GH, HU, ID, IL. IS, JP, KG, 
KP. KR, KZ, LC, LK, LR, LT, LV f MD, MG, MK, MN, 
MX, NO, NZ, PL, RO, RU, SG, SI, SK, SL, TJ f TM, TR, 
TT, UA. UZ, VN, YU. AR1PO patent (Gil, KE, LS, MW, 
SD, SZ, UG, ZW), Eurasian patent (AM, AZ, BY, KG, KZ, 
MD, RU, TJ, TM), European patent (AT, BE, CH, DE, DK, 
ES, FI, FR, GB, GR, IE, IT, LU, MC, NL. PT, SEj, OAPI 
patent (BF, BJ, CF, CG, CI, CM, GA, GN. ML, MR, NE, 
SN, TD, TG). 



Published 

Without international search report and to be republished 
upon receipt of that report. 



t 

Q 

O 

o 



(54) Title: METHOD AND SYSTEM FOR NETWORK INFORMATION ACCESS 




1 



INFORMATION 
SOURCE 



(57) Abstract 

This invention provides assistance to a user in accessing network attached information sources. In one aspect, the invention is a 
method for intelligently routing a user query to information sources relevant to that query, extracting relevant data fields from received 
responses, and intelligently presenting the extracted data in order of estimated relevance. The system of mis invention implements one 
or more steps of the method in a centralized or distributed manner on one or more network attached computers. Further, this invention 
provides a novel language and implementation that facilitates easily written and maintained descriptions of information source query and 
response formats. 
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1. FIELD OF THE INVENTION 
The field of this invention relates to information 
access over networks, and specifically to the automatic 
location and evaluation of relevant information available 
over public or private networks from information sources in 
response to user queries. 

2. BACKGROUND 

The exponential growth of private intranets and the 
public Internet has produced a daunting labyrinth of 
increasingly numerous documents , databases and utilities. 
Almost any type of information is now available somewhere, 
but mort users cannot find what they seek, and even expert 
users waste copious time and effort searching for appropriate 
information sources. One problem is simply the increasingly 
large n umb er of available information sources that are beyond 
the comprehension of a single user. A second problem, along 
with this growth in available information and information 
sources, is a commensurate growth in software utilities and 
methods to manage, access, and present this information. 
Each utility has a different and often unique interface and 
25 set of commands and capabilities, and is appropriate for a 
different set of users and a different set of information 
types and sources. Thus sheer diversity of available 
utilities creates problem for users comparable to that 
created by information explosion. Users are now faced with 
30 the twin problems of which tool to use to inquire at which 
information source. 

In the past efforts have been made to provide users with 
automatic, computer assisted services that can help solve 
these twin problems of the network revolution. For example, 
AI researchers have created several prototype software agents 
that help users with e-mail and netnews filtering (Pattie 
Maes et al., 1993, Learning interface agents, Proceedings of 
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•UI-,3,. agents that assist with world wide web browsing (H. 
Harmon. 1..S. «ti 2 ia: An agent that assists web browse. 
.roc. isth int. Joint conf. on A.I. PP. "4-929; Robert 
Strong et .1., 1992, Webwatcher; A leaning apprentice for 
5 tTworld wide web, »or*ing Notes of the aaai Spring 
Symposium: Information Gathering from Heterogeneous, 
Abated snviron^nts, pp. 6-12. ««>ford «"-r«ty, AAAI 

egents that seheduie meetings (Lisa Dent et »!., 
79,2 A pTsonal learning apprentioe. Proo. 10th Net. Conf. 
,'r r* 96-103; Pattie Maes. 1994, Agents that reduce 

10 on A.I., PP- 96 1»J. .31-40. 

worx and intorsation overload. Conn, of the ACM 2Z 7).31 40, 
worx an Bteerienoe with a learning 

146; Ton Mitchell et al., 1994, BP"' _„*. 
1*0, xwiu : 81-91) , and agents 

personal assistant, Conn, of the ACM 31™-"- >• 
Lat «tfot. internet-related tasks (O. Etsioni et .1., 1994, 
is --^blt-bLod interface to the internet, CJ1CH 37(7,=7 2 - 75). 
^0^,1^, the infornation such agente need to ecoess is 
.ril.:ie In'th. world Wide Web. Unfortunately even a 
domain ss standardized as the WWW has turned out to pose 
significant problems^ aut^atio 

format of information displey, s^lng no attempt to hint 
Its meaning or se»ntic oontent. currently, no accepted 
.semantic narxup lanovage* for the web exits, nor is one 
„ iLely to adopted universally- The Internet can be expected 

- ^ZITT^TZ^. the xntemet. and the 
„orld Wide W* have posed severe i — ^ J^XT « 

heretofore provided sufficient additional value to replace 
«,e use of a web browser having access to existing 
^rectories and indices such a. Yahoo or I-ycos. Second, such 
directories understand and competently 

services have not yet been able to una 
35 oaree relevant information fro. the responses returned from 
^variety of Internet and Web information sources. Third, 
existing Services and agents have not been easy to adapt to 
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the ever- increasing numbers of sources with their ever- 
changing response formats. This is due to the 
individualized, hand-coded interface to each Internet service 
and Web site utilized by existing agents (Yigal Arens et al. f 
5 1993 , Retrieving and integrating data from multiple 

information sources, International Journal on Intelligent and 
Cooperative Information Systems 2(2) : 127-158; 0. Etzioni et 
al., 1994, A softbot-based interface to the internet, CACM 
22(7): 72-7 5; B. Krulwich, 1995, Bargain finder agent 

10 prototype, Technical report, Anderson Consulting; Alon Y. 
Levy et al., 1995, Data model and query evaluation in global 
information systems, Journal of Intelligent Information 
Systems, Special Issue on Networked Information Discovery and 
Retrieval 5.(2); Mike Perkowitz et al., 1995, Category 

15 Lrcuisic.Lion: Learning to understand information on the 

internet, Proc 15th Int. Joint Conf. oirA.I.). Preferably, 
a service or agent should be able to access a new or changed 
Internet information source in order to automatically learn 
how tc retrieve relevant information from the source. This 

20 would be advantageous even if such a facility were limited to 
groups of sources with response formats selected according to 
certain constraining principles. 

3. SUMMARY OF THE INVENTION 
25 It is a broad object of this invention to solve these 

fundamental problems by a method and system that provide a 
personalized network robot, called a "netbot." A netbot 
acts as a user's intelligent assistant by tracking available 
network information sources, knowing the relevant information 
30 and features of each particular source, and upon user request 
determining which sources are relevant to a given query, 
forwarding the query to the most relevant information 
sources, understanding the responses returned from each 
source, and integrating and intelligently presenting the 
35 query results to the user. 

The netbots of this invention possess several 
advantages, including the following. First, a netbot returns 
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only th. aost relevant Information to the user, on the one 
hand, each user query is forwaraed only to the prxaary 
infection source deteradned to be most relevant, on the 
oth«r hand, intonation souroe responses are parsed 
5 o^erstood so that only the relev»t data it«s are extract* 
for user presentation. Duplicate, stale, and xrrelevant 
infornation iteas are discarded, second, a netbot rs fast. 

.uto»tically searches the relevant ^ *™ 
in parallel, it can present information as quickly as th. 
X. fastest primary source returns a response D.sprt. changing 
oonditions which cause different infor-atron sources « 

fluctuate in speed, a netbot integrator renains as fast as 
the fastest source. Sources that have no inforaation to 
return to a query do not slow th. user since the netbot 

1S i!T y l5 n,r« thaa. Third, netbots ere easily adapted to 
^ever- increasing nuaber of network infection sources 
^h ever-chunging response forests. ».tbots utilise a new 
a^nov.1 decretive language for describin, inforaation 
aoorees. A source description is short and easily 

20 understandable, and therefor, is easily written end 

""'T^iore, in on. espsct th. inv.ntion includes a nathod 
for efficient access to infornation sources on a network 
oonprisin, preferably one or acre of the following steps: 
,5 receiving a user query for infornation; determine, the 
ration sources aost relevant to this query; 
Ascription of each infornation source; for^ttin, the query 
accordin, to this description in a »mn« auirabl. *« ..ch 
tnfonatlon sourc. and transacting th. fornatt«i query to 
ZZZZ receivin, responses tron th. inforaation source 
Z each source, understanding and extracting ^. r.levant 
to ta fields accord^ to the retrieved description; and 
pr.s.»tin, to th. us« th. relevant data froa «<* 
infornation sourc in an int.lliq.mt nanner ranked by an 
as estiaate of its relevance. Advantageous iy, these steps are 
p^oraed in parallel to th. gr.at.st -ft poa-ibl.. m 
particular, at Last, all qu«i.s are transmitted to all 
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relevant information sources in parallel without waiting for 
intervening responses* 

In another aspect the invention comprises a computer 
system and apparatus for performing one or more steps of the 
5 method of this invention. The user has a presentation device 
attached to a network to which is also attached a plurality 
of information sources. The presentation device receives 
user queries and displays netbot responses. Further , the 
presentation device performs one or more of the steps of the 
10 method of this invention. One or more of those steps not 
performed on this device can advantageously be performed on 
network attached netbot server computers, which respond to 
functional requests from the user device. Optionally, the 
user device can range from a diskless hand-held terminal , to 
b a PC, to a works tut ion, and so fortn. 

In a further aspect the invention comprises a new 
language and language implementation to facilitate the 
creation and maintenance of descriptions of information 
sources. Importantly, this language recognizes relevant data 
20 fields in responses returned from information sources and is 
capable of extracting all such fields. In the preferred 
embodiment this language has an action statement component 
and a regular expression component. The regular expression 
component has novel features for creating modular 
25 hierarchical descriptions of regular expressions, for binding 
variables to the correct sub-strings recognized during 
pattern match to a response of an information source, for 
performing arbitrary action language statements with multiple 
variable bindings, and for specifying backtrack free 
30 recognition of sub-strings where possible. 

«• brtbp pBBcarpyypff oy res DMretwp 

These and other features, aspects, and advantages of the 
present invention will become better understood by reference 
35 to the accompanying drawings, following description, and 
appended claims, where: 

Pig. 1 illustrates generally a netbot of this invention; 
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Figs. 2A-B illustrate exemplary user interfaces of 
embodiments of the netbot of Fig. l; 

Fig. 3 illustrates exemplary functional components of 

the netbot of Fig. l; and 
5 Fig. 4 illustrates alternative hardware embodiments of 

the netbot of Fig. L. 

5. pffTft ^gP P FBCKCPCTOM 

For clarity of disclosure, and not by way of limitation, 
,0 the detailed description of a netbot of this invention is 
presented as a method or process for network information 
access, as a system or apparatus implemented to perform that 
method, and as a novel language designed to assist m 
implementing this method and system. 
x - - the relieving, first: an overview of these aspects of 

" the invention is presented followed, second, by a detailed 
discussion of individual components. 



5.1. nyff ffyTCT OF fff*™* MtCHTTBCTTOE 
20 A netbot method or system of this invention comprises 

software and hardware facilities that function together in 
one or more network attached computers to assist a user in 
access to information present in network attached servers 
(known herein as 'information sources"). Fig. 1 generally 
25 illustrates the relationships of a netbot to a user and to 
networked information sources. For example, user X accesses 
user computer 3 through standard interface devices, such as 
monitor 2. In the course of work, the user needs information 
from information sources 7, attached to the user computer 
30 through various network links, such as network links 4 and 6. 
Since the information sources are many, the user can benefit 
from assistance in finding needed information from relevant 
information sources. This assistance is provided by netbot 
5, which maintains an awareness of available information 
35 sources and queries them through links 6 on behalf of , or as 
an agent of, the user. Alternatively, netbot 5 can partly or 
wholly reside on user computer 3 or be partially or wholly 



WO 98/12881 



distributed on the network and accessed by the user through 
link 4. 

Groups of sources 7 having similar sorts of information 
are grouped into conceptual classes called information 
5 domains. For example , one domain can be that of electronic 
stores for a particular product; another domain might include 
Internet indexes containing information on the keyword 
content of various World Wide Web ("WWW") pages. 

In a preferred embodiment, the netbot is composed of 
10 three major functional modules: a user interface , an 

integrator, and an I/O manager. Briefly, the user interface 
module interacts with the user to receive user queries for 
information, and to format and present information responses 
received from the network attached information sources. 
xo Advantageously,- the user interface is adapted to the specific 
information domain being accessed. The integrator module 
accepts a user query from the user interface module, selects 
relevant information sources, formats it for network 
transmission to each relevant information source, receives 
20 responses from these sources, under stands these responses, 
and passes the relevant portions of the responses back to the 
user interface module for display to the user. The I/O 
manager module performs hardware, operating system, and 
network specific interfacing for the user interface and 
25 integrator modules so that porting a netbot to different 
hardware platforms, operating systems, or networks requires 
changes only in well-modularized code* 

In particular alternative implementations, either or 
both of the user interface or the I/O manager modules may be 
30 absent. For example, the functions of these modules might be 
already performed by other operating system components. A 
netbot can be provide only one or more of the facilities 
disclosed without providing others. For example, in some 
embodiments, a netbot is useful that preforms query routing, 
35 alone or in combination with relevance ranking. In other 
embodiments, a netbot can simply format queries and 
understand responses. Further, additional modules may be 
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present in or accessed by a netbot. For example, a learn,ng 
module can be present which provides for a netbot to acquire 
automatically the characteristics of a new network 

information source. Finally, as is Known to those^ of skill 
5 in the art, the functions performed by the described modules 
ttay be divided or grouped in alternative fashions among a 
greater or lesser number of modules. 

in the absence of specific preferences, the processes of 
this invention can be implemented in complete procedural 
10 programming language, such as C, or a complete ob3ect 

orlTnted programming language, such as C++, on the disclosed 
hardware configurations. 

ZBB "SF.R IHT m™ra MODULE 

„ more detail, the user interface module bas both 
~ important functionality that is common to netbot user 

interfaces, whatever the information domain to which netbots 
are directed, and also has adaptations to the particular 
information domain of a particular netbot. fuming first to 
20 the preferable common functions, one such is the J*""*!! 
remember a user's preferences for interacting wxth a netbot 
such remembered preferences include, for example, whether 
netbot is to fetch pages in order to calculate their 
relevance itselt, the number K or relevant information 
25 sources to query, display characteristics, etc A second 
such common function is incremental display of information 
received from a query. Since each query usually causes a 
netbot to consult many independent information sources, 
results are often received at widely varying times. 
30 immediate display of asynchronously received results would 
cause such undesirable effects as screen flicker or 
disorienting rearrangement of already displayed data. 
Increment.1 display accommodates, on the one ^nd user 
desires to view information quickly with, on the other hand, 
35 user desires for a comprehensive view of all received 
information, sorted according to relevance. 
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The incremental display strategy preferably provides one 
or more windows on the user's screen with several defined 
views of the query satisfaction process along with certain 
common user controls, such as screen buttons, for 
5 manipulating these windows. In one window, the user 

interface module presents lists of the information sources 
being consulted with each source symbolically represented as, 
for example, a network address, an icon, or another compact 
screen representation. Then, as an answer is received from a 
10 particular source, its associated screen representation 
changes appearance, e.g., by changing intensity, changing 
color, etc. Also displayed is a count of the total number of 
unique information items received currently. Optionally, 
clicking on the screen representation of an information 
IS source opens a further window with either information about 
thit information source, or a display of the responses 
received from it, or access to the information source over 
the network, etc. 

One of the common controls is preferably implemented as 
20 a common show-me-button that when activated causes display of 
another window in which a list of all currently received 
responses is presented ranked according to their estimated 
relevance to the query. Another common control is preferably 
implemented as a more-button in this latter window that waen 
25 activated, causes re-display of all prior data items merged 
with items newly received since the last window display. The 
newly received items are merged into the display list in 
order or relevance, and distinguished from prior items by, 
for example, their color or intensity so that the user can 
30 avoid scanning prior items again. Optionally, clicking on a 
data item opens a still further informational window giving 
either the source of the item, a display of the response 
containing the item, or access over the network to the source 
of the item, etc. 
35 In addition to such common functions and controls, a 

netbot user interface module preferably implements specific 
designs, formatting, and fields suitable to the information 
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aomain for which it i. design^. For ««U, • 
prison shopping in an information domain of electronic 
stores on the Internet can have a particular -terface 
presentation containing labeled fields for product neme, 
S looet and price, on the other hand a nethot for access to an 
infoiiation dosain of online Internet indexes can have an 
Interface with fields labeled with elects derived fro. the 

""""in on. netbot embodiment, most functions and modules - 
„ reside on networ* attached servers which a user accesses 

remotely. For example, the user say access a netbot over the 
Werner with the world Hide web protocols utilizing a web 
browser, such as netscape. In this case the user interface 
builds HTML formatted pages which are transmitted over the 
„ TetwtrR by the I/O manager . Fig. 2A generally Ulustratee 
the user display from an example of such an «"* 
is further directs to the infection domain of 
electronic software stores. The netbot display of Fig. 2* 
aivlTed into three sections, section 11 i. a t tie section 
»e eenerelly indicating that this display has results from a 
20 generelly rncic w „ nethot preterably also 

netbot for shopping, a "shopBot. a ™™ „ 
h!s a specific input guery screen. Section 12 presents the 
Ust of online stores^urrently being consults represented 

25 further infection or direct w«w eccess. at IS in section 
„ tnsse sources which have already returned guery reeu te 
«; lilarly represented. Section 13 «""»»«-~££ r 
received so far formatted in accordance with this Particular 
Inflation domain into sections for the major PC operating 
20 system.. Bach Individual item « ^ £ 

is formatted with product name, price, and an 1,a " 
originating information source. In this implementation, 
"formation display is controlled with the window scrolling 
ssd control facilities built into the web °^ 
„ interface is implements as a HTML formatted page^ created 
a netbot server and transmitted to the web browser. 
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Fig* 2B generally illustrates the user display from 
another example of a web-browser-based user interface 
embodiment , which in this case, is directed to an information 
domain of WWW indexes or search engines. This display is 
S also generally divided into three sections. Section 71 
displays a title for the netbot; section 72 displays the 
status of the current search; and section 73 displays the 
search results. In more detail, the display of section 71 
includes a logo for this netbot, "MC" standing for 
10 "MetaCrawler , " a name chosen since WWW search engines are 
also known as "web crawlers ," and controls to access certain 
system level presentation features, such the MetaCrawler home 
page and user feedback pages. The display of section 72 
includes list 74 of the search engines being queried 
i£ identified by their common names, th~ status of the current 
query in general and at each search engine, and common user 
controls. Generally, pie-chart icon 78 summarizes that 7 of 
the 8 search engines queried have already responded to the 
query. At search engine 75, known as "Lycos," the check mark 
20 indicates that a response containing information items has 
already been received. At search engine 76, known as 
"Inktomi," the cross mark indicates that a response without 
any information items has already been received. On the 
other hand, search angina 77, known as "Galaxy," is visibly 
25 distinguished from the other search engines to indicate that 
it has not yet responded to the query. The common controls 
of section 72 include more-button 79 to request the display 
of newly arrived search results, and modify-search-button 80 
to request a new or modified query be sent. Lastly, the 
30 display of section 73 includes the information items returned 
from the search engines. Each information item is displayed 
separately and includes title 81, descriptive text 82 if 
available, and line 83 with the URL of the web page for this 
item and the estimated relevance of this item to the query, 
35 here "1000." The items are sorted for display by descending 
values of the estimated relevance. The displayed items are 
scrolled using controls provided by the web browser. This 

- 11 - 
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UMr interface is fomented as a aav. -W**"^"^ 
frOT . n.tbot s«ver and executed by the ua b-bro«s«r^ In 
this manner , the interface of*, . U -J- ' « 
interactivity than that of Fig. 2A. For examp 
s he ne^ot server for current s~rch status and update the 
status displays accordingly without user action. 

lltnougn the user interface is described primarily in 
teraa of windows and buttons, one of skill in the art will 
recclire that this invention is adaptable to other displey 
that provide tor display of infarction and input 
" TuseHo^nds for e»mple, the user interface module can 
con^l tTentire screen and present graphical displays 
without intervention of a windcvin, systee. ^ 
The user interface module is preferentially H 
IT^ect ,ric,ted progrommin; l»gua,e supplemented 
" E das. library providing windowing function.^ A 

treble implementation usa. ^ 
with the Java.awt package, see, for emampi 
^ tb » w„t.h.ll, CBeilly * Associates, sections 5 and 

" m r mmm. mbm ^ ^ praterred firootiona i modules 

d ata SL! and -ir functional ^^^JST 

,.. -t, -^neral ana of an integrator module 3/ in p 

30 in .jsnerai ana u*. functions: a 

- :rrr:r^rpperrr.°: 0 !^ an ^-n 

ZTn These components are introduced here and 
^•^11 in the foll^- -~ ^ 
melivered ~~ — 
" r^netlot in o^er of relevance and to return the H 

rr^sd Hert the integrator retrieves the K 
^ for the H most ^ 
database 40. These wrappers, which are **« r £ ln ^ 

„ information source and its retirements, ar «r itt^ 

retrieved wrappers are usea Dy *w <* 
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to format the query into forms recognized by each information 
source, and second, to understand the information returned by 
each information source 33 in order to eliminate extraneous 
formatting matter and to put the received information into a 
5 common format. This formatted information is then aggregated 
and passed to the user interface module for presentation 32 
to the user according to preferred incremental display method 
36. The user display is also controlled by stored user 
preferences 35. 

10 

THE QUERY ROUTER 

A well-behaved netbot should preferably use scarce 
network bandwidth and information source processing resources 
in a competent and efficient and frugal manner, while at the 
i* same time best answering eacn user query. Such behavior 
minimizes resource usage, and thus achieves best overall 
performance, both for the individual netbot and for all 
netbots functioning simultaneously on a network. 

The query router is important to achieving this behavior 
20 because it permits the netbot to send requests only to 

information sources likely to have information relevant to a 
query. From a u6er query, the query router determines the 
relevance of each information source to the given query and 
returns the N most relevant sources. M is a parameter 
25 controllable at user preference and can be as small as 1. 
This relevance determination is preferably over inclusive 
rather than under inclusive. Occasionally including an 
irrelevant information source is preferable to missing a 
relevant and important source. Further, it is preferable 
30 that this relevance determination be quick to compute, not 
requiring costly processing techniques. 

In a preferred embodiment, the query router calculates a 
numerical relevance rank value for each information source 
that estimates the source's relevance. This calculation is 
35 based on the concept of "conceptual classes. 19 Thus each 
information source is tagged in advance with the conceptual 
classes for which it is relevant. Then the query router maps 
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each query to the conceptual classes relevant to it. an* 
"nl indorsation sources with conceptual classes sb«- W 
the query. The sappin, of a query to its conceptual classes 
is preferably done with a hash function. 

s 

MfiEBgmaH mam ■_...„, taction of 

The awreqation engine is the coordxnatin, function of 
tn. iLgrltor sodul.. It receives the user guery free th. 
Ter interface sodul. and revests the query router to 
„ previa, a list of ths » infection sources sost r.l.v»t^ to 
10 provic retrieves the M wrappsrs for the M 

Z:jZrZ~TZn\** wrapper datahase. Guided by the 
" ^aggregation engine tr.nsl.tes the quorate 
Inquest forsats accepted by each c, the H infection 
.. ^-^eTand tssnsfers ths . revests to ths I/O sen.,- for 
n^vorK tran^ission. For sons sources, the query say be in 
^1 forsat of a fors to be returned, whan a response is 
rTivT fro. an infection source, the = - ^ 
again guided ^ ^ ^0^1.^ celled 

^tm-allv each tuple cn be assigned a priority 

domain, optionally, eacn i. v n™i» T 

order using a sethod appropriate to the P"* 1 ^" 
infection doaain. Finally when the increeental display 
« Placer requests data to present to the user, perhaps in 
rsolnse T. .ore-button re*»st, the agnation engine 
passT^ tuplss to the user int.rf.ee -odule. sorted in 
nriority order if . priority is determined 

software version number, operating system required, price, 

!° c In exemplary priority order of the tuples can be by 
etc. An exemplary p ^ ^ pre£erence . 

price, by delivery delay, o ^ information donai „ relates 
35 For a further example, if tne mrorm 

to world Wide web ("WWW") search engines, which index 
information pages available generally on the WWW, then the 
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tuples optionally contain such relevant fields as title of 
the indexed page, the universal resource locator ("URL") of 
that page, relevance scores estimating the relation of the 
indexed page to the query, descriptive text, etc. An 
5 exemplary priority order can be based on the netbot's 

normalized estimate of the relevance of the indexed page to 
the query. If the netbot does not retrieve the indexed page, 
then it sums the normalized relevance estimates for this 
response that are returned from each of the search engines. 
10 If a search engine does not return a relevance estimate, a 
default value is used. The obtained relevance estimates are 
then normalized by linearly adjusting the returned scores to 
have a common maximum of, e.g., 1000, and then multiplying 
the adjusted scores by a confidence factor. This confidence 
15 factor, ranging from 0 to l, is a predetermined estimate or 
the reliability of a particular information sources own 
relevance estimates. For example, it can be determined by 
practical experience with the information source's relevance 
estimates. Alternately, the user can request the netbot to 
20 retrieve the page in order to do its own relevance estimate. 
In an exemplary embodiment, for queries requesting the 
presence either of all query words or of amy query words, the 
estimate is determined by scanning the page and counting the 
number of query words actually present, and then scaling the 
25 count so that the presence of all words results in the common 
maximum relevance value. For queries requesting the presence 
of a phrase, the estimate is determined, for example, by 
subtracting from the common maximum a normalized sum of the 
square of the distance in the page of each word of the phrase 
30 from its successor word in the phrase. Thereby, if the 
phrase appears contiguously in the page the relevance is 
high, whereas if the words of the phrase are widely separated 
on the page, the relevance is low. 

In summary, for the majority of information domains, the 
35 priority order is determined from a relevance computation, as 
in the WWW search engine example. However, for certain 
domains such as online software vendors, a priority order can 
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*. si^ly -i.ter.ined free the values of one or -ore numeric 
fields of the response tuples. 

IfflE DATABASE emi ™ s 

5 The preferred manner of describing information ™ s 

and their capabilities, in particular their query formats and 
response formats, is with compact, modular, declarative 
descriptions called wrappers. Since a netbot can access from 
tlveral hundred to many thousands of information sources the 
10 ascriptions of the sources are preferably compact, requxring 
a TnxLm of storage. Further, since new information sources 
are frequently created and existing sources frequently change 
their format, easy maintenance of *™°\** SG ^™~ of . 
iM «ortant A modular, declarative description, xnstead of a 
xmportant. A noaui v^nitates such maintenance, 

x* complex procedural aescrxprxon, xacilxtates such 

t» one embodiment of this invention, wrappers can be learned 
a separate module for information sources having 

- an exemplary embodiment, 
20 each wrapper advantageously includes the following 

information: r ^ m¥e% - /«url«) address of the 

1. The Universal Resource Locator ( UW. ) aoar 

information source; 
The conceptual classes of the source; 
A description of the mapping from query arguments, e.g. 
words or phrases, to fields of the query or HTML defined 
form used to interrogate the source (includxng site 
support for any, all, phrase, or proximity querxes); 
A description of the format of the query response or 
HTML page layout that enables parsing of relevant 
information from other information and extraneous 



2. 
25 3. 



4. 

30 



formatting matter. 
At least items 3 and 4 are advantageously written in the 

—~ 4-V1 -t e i TWfvntion • 



wrapper description language of this invention. 

a nstbot can retrieve wrapp«s in various nanners. 
on. e^odi^nt. the inforeation source itself car J-W* 
own wrapper upon request fro. a nethot. In an alternative 
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embodiment, the netbot can provide its own wrappers in 
various manners. For example , wrappers can be built in the 
netbot itself, especially where the netbot accesses only a 
few information sources. Also, wrappers can be stored in a 
5 local database or can be downloaded on demand from a 
centralized database. 

The wrapper description language (hereinafter referred 
to as the "WDL") of this invention facilitates the semantic 
description of queries, forms, and pages by using a 

10 declarative description format that combines feature from 
grammars and regular expressions. Here an example of this 
description language is presented. A detailed description is 
set forth in a later section. Syntax used follows 
conventions known to those of skill in the art for specifying 

15 grammars, including regular expressions. See, e.gr., 

Schwartz, 1993, Learning Perl . O'Reilly & Associates, Inc., 
chapter 7; Aho et al., 1986, Compilers Principles. 
Techniques, and Tools . Addison Wesley Publishing Co., section 
3-3 

20 An exemplary description in WDL of a typical page 

returned from an WWW search engine follows here- The WDL 
interpreter uses the page description to parse a page and to 
execute any specified action statements. Note that "stuff" 
is a reserved word in the WDL that matches any character 

25 string up to the first occurrence of a mandatory following 
string literal. 

<page>;:= stuff M <dl>" <item>* .*$ 

<item>::« stuff "<dt> M stuff «-\"« (stuff) ,l \ H xstrong> M 
30 (stuff) "</strongx/a> M stuff «<dd> M (stuff) 

"<br>" { output($0, $1, $2, 500) } 

This describes a page made up of, among other data, a 
sequence of zero or more items. In detail, a page is 
35 specified to consist of stuff, then the string <dl>, then 
zero or more items, then zero or more characters (".•'), then 
finally the end of the page ("$"). In general, an item 
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inches three fields of relevance, denoted by (staff) and 
barred to serially by «.. ». and «. that when an its. 
is r.cogn i2 .d are output hy the "output,)- statement. In 
^taU^an it. U ^ified t^-ist - 

5 ^Mrid o^tt"; TZZ t hy%0. then the 
«il .-><strong>< . then the second field of interest later 
referred to ^1. then the string .</.tron,x/a>. , then sore 
ttur then tie siring -<dd>». then the third field of 
stuff, men «i string «<br>". 

^rtrr:^ *». «. a. « r . 

.+.„h^ ie executed when each item xs 
an <item> xs matched, xs execute 

recognized. 

" iH^mas^tsua atructur . of . „ethot can be 

The preferred functional structure 
<~,~l to svstes hardware components in various 
assign- to syste. alt .rnativ. in any case depends 

al ^cntloc.Uon Tf Action achieves a rapid response and 
" ^at able cot Pi,. 4 generally illu^.tes er^ary 
netbot hardware eebodieents and options in vxew of the 
previous general description. It illustrate, the 
Interrelationship of user -^^TS- 
» «-—^ TT^CriS. a processor. 
^rT^ various attached peripherals^ - ^Pher.ls 
include display device 52. or other dev ce 

ooeouter 51 be a PC or a workststion running one of the 
™, orating system, the Hacintosh o^ sy.^ or 

;^„en«T 6 The local netbot software i-ple-ents one of 
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more of the netbot functions. The local system components 
can include, for example, a web browser. 

Network 57 can be any network with a plurality of 
attached information sources 58 , which are can be optionally 
5 conceptually classified by subject matter into information 
domains. In a preferred embodiment, network 57 is the public 
Internet or a private intranet supporting the TCP/IP suite of 
protocols, including such user level protocols as FTP, HTTP, 
and so forth. The information sources are server computers 
10 which make their stored information available using the 
protocols supported by network 57. Such information can 
include databases of periodicals, newspapers, etc., 
information on or produced by particular commercial, 
educational, or other types of organization, facilities for 
lb electronic commerce, etc. 

In such a network, a netbot can have various 
embodiments. In an entirely local embodiment, all netbot 
functions reside in local netbot software 55 on user computer 
51, which in this embodiment must have sufficient processing 
20 and storage capabilities. In alternative embodiments, one or 
more of the disclosed netbot functions can be distributed on 
other network attached computers. 

For example, computer 59 is a wrapper server for 
accepting requests f cr downloading wrappers from its wrapper 
25 database. The wrapper database can be stored in memory or on 
disk using any data management system capable of storing and 
retrieving compact textual descriptions. Computer 60 is a 
query server for performing query routing by accepting 
queries and returning the N most relevant information sources 
30 from the many tens or hundreds of thousands about which it 
stores information. Computer 61 is a netbot server for 
performing the integrator module function by accepting user 
queries and returning search results, perhaps using the 
facilities of wrapper server 59 and query server 60. With 
35 these network servers, local netbot software preferably only 
supports the user interface, which may be delegated entirely 
to a web browser. Alternately, it can further include the 



- 19 - 



na/usy'/i'LM 



WO 98/12881 



action engin.. which na*ee query routu,, requests to 
£ery server 59 and wrapper requests to wrapper server 6 0. 
^er. It can include on. or both of these latter 

, ^"'various computers of a nethot syete. can be provided 
with stftware for performing the methods of this inventxon 

networlc. This invention is adaptable totoown 
optic »edi., such as disks, tapes and CD-ROH. 

" $.2. TBI T'" "»■""» 

The I/O «nager nodule perfoms hardware. °P«» tln « 
system! and networ* specific interfacing for the integrator 
Lule'. -etworK interfacing include. - 

r-eeeivina responses from newom 
15 requests and receiving p a „„ii ca tion of the netbots 

information sources. An important application or 

-relevant protocols of - 

— trJacing includes the tasK of window 
aanaqeLnt for the user interface nodule and access to the 

T feraT/'th. I/H^ «- — - *™ 
25 preferably, the i/w * wlnd0 ving libraries, 

commercially available protocol stacks, vindo g 

T» va awt package, and other tools, in sob 
such as the Java.awt pacK g , nana ger functions 

implementations, more or less of the i/o * 

■ w. ^h-r svatem components on the newom 
„,„ he perform by ^« J"" 1 \ BiMqer lB designed 
30 attached computer. Optionally, the I/o w 

to be scalable to multiple naohines, to not "quire 
threaded or reentrant code, and to be cross platfom and 
persistent. 



35 
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5.3. THE AOQRBOA PTON ENGINE 

In the preferred embodiment the functions previously 
identified for the aggregation engine component of this 
invention are performed by the following process. Searches 
5 of the information source proceed in parallel because all 
requests are transmitted without waiting for any responses. 

1. Receive a user query from che user interface module; 

2. Perform query routing to determine the N information 
10 sources most relevant to the user query; 

3. For each of the N relevant information sources, do: 
A. Retrieve the wrapper for this information source 
(for example , from the source itself or from a local or 
remote wrapper database) ; 

15 b. Guided by the wrapper , format the user query into 

the form or format required by the information source; 
C. Transfer the translated command to the I/O module 
for transmission to the information source; 

4. Initialize the list of responses to be empty; 

20 5. Until a user specified time limit is reached, do: 
A. When an information source response has been 
received by the I/O module and transferred to the 
integrator f then: 

i. Guided by the wrapper for this information 
25 source, parse the response to understand the 

information returned, discard the site-specific 
formatting text and other irrelevant matter, and 
gather relevant fields into tuples; 

ii. Add each tuple to the list of tuples, 

30 optionally performing priority ranking, duplicate 

elimination, etc.; 
B. Wait for the next response; 
6. Deliver the list of created tuples to the user interface 
module on request, which can be due, for example, to 
35 user activation of the show-me-button or more-button 

controls . 
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When multiple intonation sources are queried, it is 
preferable in step « to present to the user interface nodule, 
lid Ire.V to the user, a single .erged 11*- = tuples 
extracted fro. the responses end sort«i accordm, to an 
5 senate of the significance or relevance of each tuple to 
tneTser. Such a estinate is preferably nade according to 
latnods specific to the infection d»ain 
netbot is directed. For certain deltas, a signif seance 
estate can be nade directly fro. tbe value of one or .ore 
l0 "I fields in tbe tuples. For ex-ple. ^J™^ 
electronic shopping, significance can be related only to 
price or delivery date, according to user preference For 
Tost do-ains. however, a signified estate xs .ade 
larding to relevance of tbe returned information whrch 
1S .ust be determined by e»»ining tbe responses rro. each 

SOUrC In . preferred erfx.di.ent of sucb a relevance 
deter.in.Uon. tbe user bas tbe choice of whether or not to 
have the netbot exa^ne all inf oration pages itself. If the 
^ ILes the relevance is determined by the netbot 
" UBer rinf to a do^in specific Analyse function. In a domain 
ration gueries. «- exemplary Analyse function find. 
1 number and location of guery word, ~ — * 
roenn . fie For keyword queries, responses wira m 
» ur.ser«itl greater f regency are more relevant. For phrase 
" ZVTJZs, responses having tbe words of the phrase .or. 
close'spaced. for example i» ^tenc. 
contiguous sequence, are more relevant. In otn 

wnnriatfl Analyze functions are provided, 
appropriat. ^ ^ My . ^ MtBot examine the 

r-JLTtL netbot relies on relev»ce estimates returned 
responses, tne n Dar tlculM source does 

fro. the information sources. If a particux 

'!! return a relevance estimate, a default value is used 
These estimates are then normalised to be between e^. 0 
« and !000. and multiplied by a confidence ranging factor. 
This confidence factor, ranging from 0 to 1, is a 
^determined estimate of the reliability of a particular 
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information source's own relevance estimates. For example, 
it can be determined by practical experience with the 
source's estimates. Where the same tuple is returned from 
two or more sources , the relevance values from all those 
5 sources are combined. Optionally, the relevance estimates 
returned from each source are adjusted to have a uniform 
distribution in the normalization range. 

In a particular detailed embodiment, this determination 
is performed according to the following process. Here, query 

10 routing has determined a list of K information sources, 
sourcejc with k from 1 to K, and returned their confidence 
ranks, crankjc. Each of these sources has been queried, 
returned responses, and K lists of information tuples, 
tuples_j where j is from 1 to length_k, have been extracted 

15 from these responses. The user's preference for netbot 
analysis is recorded in verification flag V. The variable 
t. score represents the composite relevance score for tuple t; 
the variable t.sourcescore_k represents the relevance 
estimate returned from information sourcejc for the response 

20 that tuple t was extracted from. 



25 



30 



35 
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in**- List of K sources with their confidence ran* pairs 
P (source K, cranK.K) . obtained fro. the query 

routine syste.; K ordered lists of tuples, 

l«*th len^Jc. obtained fro. source 
ZrcoZ*: »d the verificetion fie, V (Booleen, , 
obtained fro. user preferences 
output: Her,ed list of .11 tuples sorted by relevance. 
/* Main routine */ 
IF (V is true) THEN 
10 FOREACH k - 1"K 

FOREACH tuple t in tuples_k 

page - the HTML page that tuple t pornts to, 

downloaded if necessary 
t. score = Analyze (page) ; 

js ELSE /* V is false */ 
FOREACH k » 1. «K 

NormalizeScores (tuples_k) 

Ad justByHeight (tuplesjt) 
AdiustByServiceRanking (tuples Jc) 

ii- utiles t into MERGED LIST and 

/* Merge result tuples, t, m 

determine a composite relevance score, t.score, 
, ot .,, rned bv the information sources, 
"~rTt sa« tuple returned fro. .ultiple 
sources has its composite score incremented by the 
source score from each source */ 



20 



25 



30 



FOREACH k ■ leeK 

FOREACH tuple t in tuples_k 

IF t is not in the MERGEDLIST THEN 
Add t to the MERGEDLIST 
t.score - t.sourcescorejc 



35 
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ELSE 

t. score = t. score + t.sourcescorejc 

END IF 

EMDIF 

5 SORT all tuples t by t. score and discard duplicates 
OUTPUT sorted tuples 
EXIT /* finished relevance ranking */ 
/* Subroutines */ 

SUBROUTINE NormalizeScores (tuplesjc) 
10 /* If information source Jc returns relevance estimates, 
normalize them to fall in the range from 0 to 1000; 
otherwise r use a default relevance estimate */ 

/* «s" is the k'th information source's relevance score 
for the first tuple on the list of tuples from source Jc 
15 */ 

s - tuplesjc(l] . sourcescore Jc 
IF (s «= 0) THEN 

/* this information source returns no scores, 
therefore use default */ 
20 FOREACH tuple t in tuplesjc 

t. sourcescore Jc = 1000 

ELSE 

scaling_f actor = 1000.0 / s; 
FOREACH tuple t in tuples_k 
25 t. sourcescore Jc = t. sourcescore Jc * 

scaling_f actor; 

ENDIF 
ENDSUB 

SUBROUTINE AdjustByHeight (tuplesjc) 
30 /* Adjust the source scores to have a uniform percentile 
distribution; for example for 10 tuples, the first tuple's 
source score is adjusted to 100% of its source score, the 
second tuple's source score is adjusted to 90% of its source 
score , etc. */ 
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percent_step = 100 / Length_k; 
percent_off - 0; 
FOREAClftuple t in tuplesjc 

t.sourcescorejc = t.sourcescorejc (100 

- percent_off) / 100; 

percent_off - percent.off + percent.step ; 

ENDSUB 

source score is scaled by 0.9 */ 

FOREACH tuple t in tuples_k . ^ 

t.sourcescorejc = t.sourcescorejc - -rankk 



15 

ENDSUB 



one o£ skill in the art will re«*nise that these 
„„, bl . •„ routine alterations and 
processes are^enable to r functions in the same 

20 — tS ^t^lar other values for the normalization 
m anners ^ Particular in£ormation SO urce relevance 

range and default value for ^ such routine 

estimates can be used. This "~ lamented in 

alterations. These processes are P"*«* b * J proceaura i 
25 C++, but can alternatively be implemented xn any pro 
or object-oriented programming language. 

5.4. TTT ff1W Y ROPTWl 
The query router receives as input a user query 
™Id^s a list of words or keywords and returns as 

Ukely relevance to the ^ ----- 

information sources xs optimized ror s P 



source. 
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The preferred query router is based on the principle of 
assigning relevant concepts to information sources and query 
words. In advance, a set of concepts is chosen to describe 
the information sources of the one or more information 

5 domains to which one or more netbots are directed* For each 
information source in the domains, the relevance of that 
source to each of the chosen concepts is judged. Further, 
each word that can appear in possible queries is examined to 
determine which of the chosen concepts are relevant to the * 
10 word. Then, upon receiving the words or keywords of a query, 
the concepts associated with these words are determined, and 
then the information sources relevant to these concepts are 
found. The ranked relevance of each source is determined by 
combining the individual relevances of the source to all the 

IS concepts of the query. The case of phrase based queries is 
preferably handled by generating separate data for this query 
type. 

The preferred implementation of this process utilizes 
four tables containing relevance information. In the 
20 following, w is chosen to be somewhat bigger, e.g. 10%, than 
the number of words that can appear in possible queries; C is 
the number of chosen concepts; and S is the number of 
information sources in the information domain. 
W0RD2CONCEPT[] is a table of W vectors of C bits, where the C 

25 bits of the vector for a word indicate which of the C 

concepts are relevant to that word. C0MCEPT2 SOURCE [] t ] is a 
C by S table. For each of the C concepts and S sources, the 
corresponding entry of this table contains the relevance 
value of that source to that concept. For example, if entry 

30 <i, j> equals 5, the j-th information source has a relevance 
weight of 5 with respect to the i-th concept. 
C0NCEPT2 SOURCE [] [] is used when searching by words. For 
searching by phrases, the table C0NCEPT2PHRASE [ ] [ ] similarly 
relates concepts to sources. Finally, DEFAULT-RELEVANCE [] 

35 has a default relevance weight for each of the S information 
sources . 
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^preferred indentation perform* the followin, 
process • 

For each of the s information sources, set PE.EVM.CEm 
= DEFAULT-BELEVANCEtj]. 
p«r each word in the user query, do: 
I compute a hash unction on the word obtain, a 

,n,i w Anv suitable hash 

„ «^*on-wesleY Publishing Co., chapter 16. 

£ B ' «t the cTit vector V ecuai to „ 
b. w* 11 the concepts in v v> 

<=• «— TorCTnTreltiorsor^ by performing 
the relevance for the intot»« 

For i from 0 to C, do: 
If (i-th bit of V is '1') THEN 
FOR j - 1 to S DO 
RELEVANCE [ j ] = RELEVANCE [ j ] + 



20 



25 3. 



30 



35 



COHCEPT2SOCRCE[ i , j 1 
Monotonic.UY increasing function other than ■+« can 
It used to combine the individual concept 

reievances into a final uords in the 

n cosine the relevances for all the » or 

m for example by addino thee together, 
user ^ery, for ex.mp .aditionally do: 

in the =». of "«f« ^ o£ u,. user query Phrase,- 
* concatenate all tne wora» w*. 
3 compute the hash function on the phrase and 
obtaining H, end set the c hit vector V equal 
MOBDZCOSCEPTCM]; concepts in V to 

c . combine t» * — ^ 
the relevance for the inroma 
For i from 0 to C, do: 
If (i-th bit of V is '1') THEN 
FOR 3 = 1 to S DO 

RELEVANCE [ j ] = ^^^^iL^ an d 
Sort information sources based on their RELEVANCE, 

return the N most relevant sources. 
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One of skill in the art will recognize that this process is 
amenable to routine alterations and enhancements that 
performs the same functions in the same manner. This 
invention includes such routine alterations. 
5 This process is preferably implemented in C++, but can 

alternatively be implemented in any procedural or object- 
oriented programming language. In the case of a query router 
which maintains information on a large number, e.g., tens of 
thousands, of information sources, the query router is 

10 preferably implemented as a server process on a server 
computer to accommodate the size of the required data 
structures and the processing requirements of query routing. 

An exemplary construction of the table WORD 2 CONCEPT 
begins with the selection of concepts to characterize the 

15 information domain of interest and the determination of words 
and phrases likely to occur in user queries. For each 
concept, the following actions are performed. The words and 
phrases associated with that concept or to which the concept 
relates are assigned to the string arrays KEYS[ ] and 

20 PHRASES[]. Then the following process is carried out. 

FOR i equals 1 to the number of elements in KEYS [ ] , DO 
Apply the previously used hash function to KEYS [ i ] 
to obtain a number between 0 and W 
25 SET the bit matching the current concept in 

WORD 2 CONCEPT [ M ] . 
FOR j equals 1 to the number of elements in PHRASES[], 
DO 

Apply the previously used hash function to KEYS[i] 
30 to obtain a number between 0 and W 

SET the bit matching the current concept in 
W0RD2C0NCEPT[M] . 

These actions are repeated for each chosen concept. 
35 Alternatively, concept information can be stored along with 
the string information, using either open or closed hashing, 
in order to preserve accurate string to concept matching. 
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5.5. BgL ffl ^PF F J>**Wr*7W 
Tbis section first presents introductory material on the 
use of the WDL in wrappers. Next, the two P"" c ^ al 

«- *. h . WDL - the action language and the regular 
5 components of the WDL ™ detail . Finally , a „ 

expression language - are described in 
Me molary embodiment of the WDL is presented, 
exemplary e description o£ an information source and 

h ow a H^bot should interact with it, in particular how to 

of a description of an information source using a 
lf ^presentation encoded in the ^^^ce Currently, 
description to understand a e ^ a source and 

for example, netbots need to format requests 
to narse useful information from the pages returned y 
to parse user irrele vant formatting information. As 

source while ignoring irreiev netbots will need 

. . „ on , irces become more functional, neu« 
20 information sources oeco simp i e request- 

, il C++ This is iaport»t because new infcrsatxcn 
85 example, in C++. ""■» <• , services chance 

sources are created ^^^^rdescriptions 
their format frequently. Optionally, wrapp 
can be automatically generated using machine 
Tchniques in information domains where responses have a 
30 regular, predetermined format. importan t so 

2* Wrapper description are small. This is impo 
lv caT . e^le. he -~ — J . 

35 t^tTseurces can suppiy their o»n uppers en re^es*. 
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3) WDL descriptions can be automatically compiled into fast 
finite state automata that quickly parse the information 
returned by information sources. 

4) Using wrappers with WDL permits netbots to adapt to new 
5 types of information formats and new types of information 

server interactions in the future. 

The wrapper description preferably specifies at least 
two processes: first, requesting information from an 
information source, e.g., how to fetch the appropriate HTML 
10 formatted page from a source; and second, how to parse 

returned pages to extract the relevant data. To perform the 
first process, the WDL includes an action language component, 
which is an extensible language of expressions and 
statements. To perform the second process, the WDL includes 
15 a regular expression language component, which is an 

extensive and novel means of specifying regular expression 
pattern matching facilities. 

In alternative implementations, netbots can utilize 
alternative pattern matching facilities known to those of 
20 skill in the art. For example, the regular expression 
component could be replaced with a context free language 
specification ("CFL"). In this case, implementation of the 
WDL can follow techniques known for construction of compiler 
compilers, for example YACC. However, where possible, 
25 regular expression pattern matching is preferred because of 
its straightforward specification and rapid execution. 

An Example Of The Regular E xpression Component 

Regular expressions are advantageous for describing the 

30 format of the information returned from many information 
sources. The regular expression component of the WDL 
augments prior regular expression matching facilities in 
several novel manners- First, it permits programming 
language facilities of the action language, e.g. statements 

35 and expressions, to be executed when regular expressions are 
recognized and with variable bindings as determined by 
partial matches recognized during the overall recognition. 
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„ i-h« ^referred implementation compiles regular 
evasions intoTompact'an. efficient finite state automata. 
STS encourages the efficient an, intuitive expression 
of complex regular expression in a nested manner. And 
5 fourth! it possesses an efficient hacfctracR-free padding 

'"'^exemplary wrapper follows which is « 
internet information search engine and is written in the «DL. 
^significance of each portion is explained in comments, 
10 which are surrounded by «/*" and 

a list of input query words is passed to the wrapper 
from the aggregation engine, which -executing the 
wrapper for the information source. The argv() function 
wrapp« extracts a li«t. of the input 

• 5 of the action language extracts a 

query words. */ 
{$keywords - argv(2) ; 

/. The request to be sent to the infection source Is 
celculeted by concatenetinq. indicated by the . 
" ^ H' *h« ectlon lan^qe, three eeperste string 

Te the internet 0M. address o £ infornatlon eourc er- 
the tnltiel query fomet strinq; two. the query words to 
search; end three, the reminder or the query feme, 
string for this intometion source; end */ 
"l- .http.//s.archer.source.c OT /search.r.cq 1 Tquery 

.« . Skeywords . »ionlyrr=o"; 

The fetch notion stetewnt trensfers this query the 
30 i/O.odule for network trensnission. end then weits for 

the HTML formatted response. */ 
$page = fetch (0 , $ur 1 , " ) 9 

The HTML formatted response text is parsed using the 
35 following regular expression grammar. */> 

$result = parse <$page, <page>) ; > 
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/* An response from this exemplary source consists of a 
page containing zero or more information items. This is 
hierarchically described in the regular expression 
language by expressions for <page> and <item>, where 
5 <page> refers to <item>s. In particular, a page 

includes a chunk of text followed by the string "results 
returned...", then followed by zero or more items. */ 
<page> stuff 'results returned, ranked by' 

<item> * END 

10 

/* Each item consists of irrelevant text, HTML 
formatting codes, and relevant data fields. The 
relevant data fields are enclosed in parentheses and 
referred to sequentially by $0, $1, and $2. These 
15 fields, along with the number 500, are output as a tuple 

when the "output 1 ' statement of the action language is 
executed upon recognition of an <item>. The particular 
meaning of the definition of <item> will become apparent 
in the next section. */ 

20 

<item> ::= stuff '<hr>' stuff '<centerxbxa href="' 
(stuff) stuff '>' (stuff) '</a>' 

stuff '<font size=-l>' (stuff) '</font>' 
{ output ($0, $1, $2, 500); } END 

25 

5.5.1. DESCRIPTION OF THE WDL COMPONENTS 

This section describes the preferred features of the 
action language and the regular expression components of the 
WDL* This invention is not limited to the descriptions 

30 presented. These descriptions make use of known methods for 
describing grammars, including regular expressions, and one 
of skill in the art will recognize that there are many 
substantially equivalent descriptions. Such equivalent 
descriptions include, but are not limited to, those resulting 

35 from renaming the described syntactic elements or from 

applying known grammatical transformations to the presented 
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syntax. This invention comprehends also these substantially 
equivalent descriptions. 

level features-, an assignment statesent; sconcing 
! !™Ints suet a. "if- end -while- constructs; a tuple 

:r^r:"t«nt ; ^ - « * 

nunerio, and boolean op^ators, and cert.xn bullt-« 
™~ ' It „ U i be apparent to those of sXill in the art 
" a ntse-leve action language of e^ivalent semantic 

^anility can be constructed with slightly different 
"* Z reaiures for example, the while statesent cen he 
ZZ* by T^Tstst^t. The action ^ — ' 
15 o7Zs Mention oo-prenends such xnown eou^alent 

2S ^iimirrtn-lrlfit clones .^flJissle. 

^'Clyntax of the preferred base-level action lanvu.,. 
ls ,Wen by the following gra^r expressed in a standard 
notation. 



30 



35 
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Statement ::= 

$VariableName "=" Expression ";" 
i «if« Expression Statement [ "else" Statement ] 
j "while" Expression Statement 
S | «{" Statement* «}■ 

| "output" "(" [ Expression {", "Expression }* ] ")" «;" 

Expression ::« 

string-constant 
10 | float-constant 

! $variableName 
! Expression op Expression 
j "(" Expression ")" 

| functionName "(" [ Expression {", "Expression }* ] ")" 

15 

Thus, the action language component is defined in terms 
of statements and expressions. A statement can be: an 
assignment statement, which assigns a new value to a scalar 
variable; an if statement; a while loop; a compound 

20 statement , which is a sequence of other statements enclosed 
by "{" and "}"; or an output statement. Except for the 
output statement, the statements function in the same manner 
as in other procedural programming languages, e.gr., C, C++, 
or Pascal. For the if and while statements, the conditional 

25 argument is considered true if it is either a non-zero number 
or a non-null string, i.e., not the empty string "". The 
OUTPUT statement allows a wrapper to return information 
matched in responses froir the information source to the 
netbot module executing the wrapper. For example, executing 

30 the statement, "output (arg_l, arg_n)" causes the tuple 

<arg_l, arg_n> to be returned from the wrapper to the 

netbot. 

An expression can be: a string constant, which is a 
symbol string surrounded by quotation marks; a floating point 
35 number; a variable name, which is a name or an integer 

preceded by a dollar sign; an infix operator applied to two 
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2) 
3) 

IS 



4) eq, 



"£Zr. 'th. lanU>e P-ides operators i» 

* arithmetic operators ("+'\ ' ' ' 

5 arithmetx p ^ ||>=h m1ssM)? sfcring 

operators ( , . ' 9 . ne « ); a 

comparison operators e, flt , ^ ^ 

ct-rina concatenation operator ' ' 

£E "|-. ••-). These operators have the following 
10 ^ n ^; t> / , per£o „ indicated arithmetic operation 
on the surrounding expressions; 

- concatenate the surrounding strings; 
' ' !T >- < >. I- = Per£o« the Indicated noseric 
I^arison'of'the sorrowing nu^rs and return the 
floating point values 0.0 or 1.0 ~«*»«^ 

eg. is. ge, It. g*. ~ ■ ^ £ °™ th o ! ^ As cil codes of 
^h-Mcter-bv-character comparison of the ASCII coo 
™™Iin g strings and return the floating point 
, 0 values 0.0 or 1.0 accordingly? 

5) , : return 1 if argument is the number 0 or the string 

ww otherwise return 0; ^ 11Bont 
„ « devaluate the first ardent, if the first ardent 
6> " zero or stop end return that value els. 

evaluate the second argument and return the latter 

« uTlvaluat. the first ar^sent. if first ardent is 
' nUher zero nor stop and return that value 
Mediately; ele. evaluate the second argument and 

2. *, / 

35 \\ !''<-'. >-. «. >. ne < le ' ,e 

5. ! (boolean not) 
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6. && (boolean and) 

7 . || (boolean or) (lowest precedence) 
All operators are left-associative. 

No variable declarations are needed. First , all 
5 variables in the action language are distinguished by being 
preceded by a w $ w . Those special variable denoting sub- 
strings matched by regular expression consist of a n $ M 
followed by an integer, e.g., $0 or $1. Second, at run time r 
automatic type conversions, between string and floating poin% 

10 types, is done dynamically. If an operator expects a number 
but gets a string, the string is converted to a number by 
calling the C library function atof () , which converts the 
ASCII representation of a number into its internal 
floating-point representation. If an operator expects a 

15 string argument but gets a number, it uses the C library 

function sprintf(. ,"%£") to convert it to a string. Third, 
referenced variables that have not yet been assigned are set 
the default values 0 or as appropriate. 

The action language component has several built-in 

20 functions. The following are preferred base- level built-in 
functions. 

1) argc(), argv(): when a netbot executes a wrapper, 
it can pass the wrapper one or more arguments. These usually 
represent query parameters, query words, or query keywords 

25 supplied by the user. These arguments are accessed from 
within the wrapper language by the functions argc() , which 
returns the number of arguments passed to the wrapper, and 
argv(n) , which returns the n-th argument. 

2) fetch(): This function preferably interfaces with 
30 the I/O module to transfer a string containing the network 

address of an information source and perhaps query parameters 
over a network to the addressed information source according 
to the proper protocol, and returns a string containing the 
response of the information source. Wrappers use this 
35 function to query information sources and retrieve pages. 

3) parse (<string>, <nonterminal>) : This function takes 
a string and attempts to match it to the regular expression 
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corresponding to the given <nonterminal>. as to 
regulsr expression language component ot the WDL - This 
motion returns "1- it the match was suooessful, or 0 if 
Z string did not match the re^ilar expression, wrappers 
S Z tTis Lotion to parse tn. response of an infection 

exemplary -rapper includes the following sequenoe of 
action^laZga state.e„t=. First is a series of statements 
rr.^ and ar^O to obtain tn. q»sry parameters 
10 and to initialise a string variable, e.g. . Surl. with 
stting containing the OPX ot the appropriate page at the 
tttrltion source together with an ^^J™^ 
query. Hert, the assent statement f 
i$„rl)- fetches the query-response page into another string 
a., i a Flnallv. the function 

^vtrts^ar eUession. <p.,e>, -hich descrihes the 
sought page. 

„ Mil n HE Wmm^ ^T^ V..^., component ot the WDL 

The regular expression ( reg exp I * 
m . t ches strings against regular expressions It hasten 
f cund that re^lar ^ssi^ns are conven - to deecri^ 
format of responses returned from a wide ra g 
„ sources. However, the reg-exp language of U> is ^ 
cpacle of matching this information so that "levant fie 
can he extracted in a more rspid and more convenient manner 
tnsn can prior languages and systems such ^ «« « «rl 
The regular expression component includes novel 
„ Tat sTlve the problem with prior matching system, and 

^^rratr-th. specifioation of 
re^lar expressions to be broKen -"^^^a^.,. 

" ^.toltTir ^ .J*-^—-— 

j^, Addison Wesley Puhlishing Co., section 4.2. Writing 
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single regular expression to represent the format of a page 
from an information source, as is required in prior systems, 
often results in a very large and cumbersome expression, one 
which is difficult to write, understand, and maintain* To 
5 solve this problem of existing systems, the reg-exp component 
specifies a regular expression by a set of rules for 
components of the regular expression. These components are 
labeled by nonterminals. However, in contrast to 
context-free grammars, the set of rules in the reg-exp 
10 component are not allowed to be recursive or 

mutually-recursive. In other words, the rule for a 
particular nonterminal cannot directly or indirectly refer to 
other rules which refer to that particular nonterminal. 

The following exemplifies the use of a set of rules and 
15 nonterminals. A top-level nonterminal defining an 
information response can be: 

<page> ::= <head> <item>* <tail> END. 
which specifies that the response is a page consisting of a 
head, followed by zero or more items, followed by a tail. 
20 The keyword "END" denotes the end of a rule. The second- 
level nonterminals on the right-hand side <"RHS W ) of this 
rule, <head>, <item>, and <tail> are defined by their own 
rules: 

<head> ::= "Results of your search: \n" END; 
25 <item> ::« "Data: ...\n" END; 

<tail> "No more results\n" END. 

To execute these rules, the reg-exp component compiler 
substitutes into the RH5 of <page> the RHSs of the rules for 
<head>, <item>, and <tail>. The result is as if the wrapper 
30 contained the large, cumbersome composite top-level rule: 

<page> : :« "Results of your search: \n" ("Data: 
. ..\n w )* "No more results\n" END 
If the second-level rules had contained further nonterminals 
on their RHSs, the compiler would continue making appropriate 
35 substitutions until there are no more nonterminals on the RHS 
of the composite rule for the top-level rule. Because of the 
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lack of recursion or mutual recursion, this substitution 

terminates • - 
A second novel facility permits skipping groups of 
cMracters in a string without backtracking. ™ j™* 
5 regular expression Batching syste», this 

requirement is implemented in a manner that requires the 
IZZ of »ny backtracking points on a stack ana can lea. 
Z Jensive backtracking. For examsle. the prior art Perl 
IdioT " which matches any number of occurrences of any 
.. „, th. Perl idiom "r\d]*\d. which matches all 
„ character, or the p "^ l ~°_ l ^ o^ence of a digit, 
non-digit characters up to the first occ 

can cause extensive back-tracking during a match. This is 
^efficient and not preferable, since ™°™^<°^L 
should be rapidly parsed, see, 
l5 O'Reilly I K.sociat« inc chapt- 

problem, the reg-exp component introduces simp 
syntax for this common idioms 

stuff "literal-string" 
««. "stuff" is « reserved word, stuff is defines to^ match 
2 . all characters from the current character, up to but not 
including, the first occurrence of ^° ^ 
"literal-string," which must be a string literal. T 
construct allows a compact, efficient, 
i^piementation of this common and important 
25 A third novel facility allows relevant data fields 

" exacted from arbitrarily complex regular "Presses £ . 
Action language statements can be e-»dded in of 
nonterminal for execution whenever that nonterminal ^ 
„atch.d and with the variable bindings current at each 
ao occurrence of a match of that nonterminal. In the case 
the previous example, the definition for <item> can be 
extended^ "Uowss output($ o„ , »>. 

.sneveTitem; Is match.. $ 0is - ls 

'"lowed "Dates" and preceded the newline character. 
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Prior systems, for example AWK and Perl, do not have 
such a facility . Although they do set variables, such as $0, 
$1, etc., they do so only after matching the single composite 
regular expression defining <page>. In order words, the 
5 entire page is matched before variables are set. Therefore, 
in AWK and Perl, if there is more than one <item> on the 
<page>, the relevant data fields in all but the last <item> 
are lost . The reg-exp component of the WDL solves this 
problem by allowing specification of actions that are 
10 executed multiple times, once for each nonterminal match, 
with different variable bindings each time. 

Turning now to the description of the reg-exp language, 
it includes certain preferred base-level features: definition 
of nonterminal regular expressions; embedding action language 
15 statements in regular expression rules; operators lor 
expressing alternation and repetition; and literal string 
match. It will be apparent to those of skill in the art that 
a preferred base-level reg-exp language of equivalent 
semantic expressive abilities can be constructed with a 
20 slightly different choice of features, and the reg-exp 
language component of the WDL includes such well-known 
equivalents. In particular, the reg-exp language comprehends 
variable renamings and known grammatical transformations 
applied to the rules below. 
25 In optional embodiments, the preferred base features can 

be augmented with such additional features as: a special 
disjunction ("11"); repetition for an arbitrary integer; 
matching to character classes; local string memory; and 
anchoring characters. Such additional features are standard 
30 with the exception of the special disjunction, which causes 
the listed alternatives to be matched simultaneously as the 
string is parsed, the first alternative to match being 
returned. In contrast, the regular disjunction matches each 
alternative in the sequence listed, backtracking until 
35 success is found. The additional features can be added to 
the base- level language in known manners. See, e.g* 
Schwartz, 1993, ^yniirq O'Reilly & Associates, Inc.; 
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». et .1., «... a nwm rmnpi*. "^"^ 

, ,- ( .™ weslev Publishing co.i Hopcroft at el., 

. ^ZT^T^Z. eminent, —ions coul, 
control backing in to Uprov. 

performance of the reguiar egression -etch 1^ 
„<™, of a rale were Known to require no backtracKing. 

» ar P =rrorrc:=n;. £ ris ^ _ 

£Ton tf the rule to be -tche* sore efficiently then 



0thOT r syntax of - prefers base-level -J- ^nguage 
« is given by the following grannar stressed in a standar 



notation. 



Rule 
Regexp 



:: « <nonterminal> Regexp "END" 

Sequence ( "I" Sequence )* 



20 sequence Repetition* 

Repetition Tern «?" 

| Term "* w 
j Term ■+? 

| Term 

string_m_Oouble_Quotes 

J "stuff" 
j «(» Regexp ")■ 
| <nonterminal> 

| Action 

M/« statement* "}" 
30 Action •• * 



25 Term 



, ™<-U:ies a particular <nonterminal> to 
Briefly, a rule specifies a pan, reaula r 

recognise a particular , se^ces; zero 

session - S o"rl repetitions , and 

3S cr sore repetitions , + ) , . ^ literal strings 

enctose/by -11^11 ear... the special sy*,ol stuff, 
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parenthesized regular expressions; nonterminals, which must 
be defined another rule; or statements written in the action 
language * 

5 Tfre Wrapper Peppy;tpt;ton tanqygqe 

A complete wrapper description can be: 
WrapperOescription statement Rule* 
Thus the preferred WDL entities include a statement in the 
action language, which is typically a sequence of action 
10 language statements, followed by an optional set of rules in 
the reg-exp language, which define a regular expression for 
matching a response returned from information sources. 

To execute a complete wrapper description, the statement 
is executed as is described subsequently. Typically, the 
15 statement contains calls to the built-in functions fetch () 
and parsed, among others. The parse () built-in function 
attempts to match a response returned by fetch () by invoking 
a nonterminal defined by the appended rules, each of which 
defines a regular expression. If the regular expression 
20 match is successful, all the action language statements 
typically embedded in the regular expression are executed, 
typically some action statements embedded in lower level non- 
terminals are executed multiple times with the operand 
bindings current at each occurrence of a match, and the 
25 parse () function returns the value M l w . If the match is 
unsuccessful, none of the embedded statements are executed, 
and the parse () function returns the value "O". 

5.5.2. iKPEEWPFTftTJEPW OP TPP WPI* cpMPOHEHTP 
30 The preferred implementation of the WDL is described in 

this section under the following headings: (1) parsing 
regular expression rules, (2) intermediate code generation 
for regular expressions, (3) run-time interpretation of 
regular expressions, (4) action language code generation and 
35 run-time interpretation. It is understood that this 

preferred implementation of the WDL is exemplary. It is 
known to those of skill in the art that alternatives exist to 



43 - 



WO 98/12881 



PCT7US97/171M 



^ Implementation describe* herein. For example, the 
peases described can be implanted differently , 
with different variables and different orderxngs of the 
individual steps. Also, the processes may implement 
S alternative al'orittas to achieve the sa» effects for e.g., 
S alternatx » example, the described 

^eTtTtr discloses processes of interpret-, various 
. baOctracKin, stacK o^coe. « tS^TUS^- 

r"££S, - — — —"nr 

t . ^. cTld be used, f inally, instead of interpretation * 
Impossible to compile directly to a machine language using 
the disclosed syntax-directed methods. 

~ Xn L reTinder of this section, a »~ T 

"d w . and has child nodes c_i., 
has label «1", contains data d < an trce 

c n« is denoted by «Kd;c_l c_n> . These pa 

nodes can be constructed and referenced in way Known xn the 
art e.a., by a pointer to a data area containing node data. 



25 



TST^S-. implementat ic. of 
filing a — ^^f - ^1— 

TfiX the regular expression to match^ — 
" rrl^rrr r ttXerr and additional 
parse trees for each lower-level rule. 

t« a oref erred embodiment, the process of this step is 
performed Z Parsing according to a recursive descent parser 
performs oy p aceor dxn, to a syntax-directed 

" TS^T »Xotion of a recursive descent compiler for 
The piously described syntax of the reg-exp language is 
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veil known to those of skill in the art. For example, it is 
clearly disclosed with examples in such textbooks as Aho et 
al., 1986, Compilers Principles, Techniques , and Tools . 
Addison Wesley Publishing Co. at section 2.4 and 4.4. Syntax 
5 directed construction of parse trees is covered with examples 
in section 5.2. This invention is not limited to this 
preferred embodiment. Alternative parsing techniques are 
known in the art, and this invention comprehends embodiments 
using such techniques, such as LL or LR parsing. See 
10 generally Aho at al. at chapter 4. This invention also 
comprehends alternative techniques of intermediate code 
generation known in the art. See generally Aho et al. at 
chapter 5. 

The specification for the syntax-directed analysis 
15 includes rules for the parse-tree nodes to be created when 
each syntax rule is recognized by the parser. For each of 
the previous reg-exp syntax rules the following nodes, 
labeled by the nonterminal of the rule, are created: 

20 1. Rule: Create the node "Rule <nonterminal-name; 
node-f or-regexp> . 11 This is the node for an entire rule 
labeled with its nonterminal name. 

2. Regexp: Create the node "Alternatives 
<node_for_sequence_l, . . . , node_f or_sequence_n> . 11 This is 

25 the node for a disjunction ("|") of alternative regular 
expressions, which are tested for a match in the order 
listed, the first successful match being returned. 

3. Sequence: Create the node "Sequence 

<node_for_repetition_l node_f orjrepetition_n> . " This is 

30 the node for a sequence of regular expression patterns each 
of which must sequentially match. 

4. Repetition: Create the node "Repetition <type; 
node_f or_term> . M This is the node for a repetition, where 
type is one of "?" (0 or l repetition), "+ M (1 or more 

35 repetitions) , or "* M (0 or more repetitions) . 
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5. stri^_I»J>ouble_Quotes: Create t» nod. 
< U teral_ S « ing >." This is the nod. to -.ton a literal 

T^utf. create the node » S t««<>» for the reserved word 
5 "stuff." 

7 (Regexp) : Create the node -Sequence 
0 P e~"e 8 is<i>. node.^erp. 01o^ar.^U< >>•" 
Zs is the node which oaases assent to a sequentially 
numbered variable on a satch by "Begexo." The nod. 
„ C»P.renthesis<n>. is a node representing the n'th open 
ptosis that is enoountered in se^enti.Uy pars n, tbe 

«. The rule "CloseParenthesis<n>" is 

Dn e n f the current rule, me tuxe 
ToL represented the corresponding n'th close parentheses 

15 r° U <nonttrminal>: Create the node "Hontemxnal 

<nonterminal_name>," for an instance of a nontermxnal. 

9 . Action: create the *^*™£ m 
<no de for action_language_statement>. Tnx 
an instance of an action language statement xn^ ^ 
20 node for action.language_state.ent represents intermediate 
code for the action language statement. 

, th i nh executed when the parser 
These rules., which are execuw 

— s the ™raTar^ 

tree is input to subsequent steps of this process. 




30 ^^ode generation in the preferred embodiment includes 

following three steps: 

1. special preprocessing of stuff nodes, 

2. Eliminating occurrences of nonterminals in the BHS 

of the top-level nonterminal? and * 00 - le vel rule 

35 3. Converting the resulting RHS of the top-level rui 

into a finite state automaton («FSA«) . 
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These steps are, first, generally described, and second, 
described in detail. 

First, preprocessing of stuff nodes is necessary to 
later use the preferred postorder traversal method of FSA 
5 generation. As an FSA cannot be built for a stuff element 
alone, it is necessary to. build an FSA for the combination of 
a stuff node and the literal string node that must follow 
each stuff element, only open and close parentheses can 
intervene between a stuff node and its following literal 
10 string. To do so, the parse tree is traversed and each stuff 
node is merged with the following literal string by replacing 
both nodes with a new stuff _And_String node. The new node 
also accounts for any intervening parentheses. 

The second step in code generation substitutes all 
15 nonterminals in the SHS of the top-level nonterminal wich 

their respective RHSs to generate a single composite rule for 
the overall regular expression. During this substitution, it 
is necessary to renumber occurrences of variables in action 
language statements. 
20 For example, consider the wrapper 

<page> ::= ("dave") ("bill") <a> ("dan") { 

output ($2) ; } END 
<a> "oren" (stuff "cody") { output($0); } END 

Here, $2 in the output statement in <page> references 
25 ("dan"), while $0 in the output statement in <a> references 
(stuff "cody"). After substitution, if variables in the 
output statement were not renumbered, the RHS of <page> would 
be expanded as follows: 

<page> ::= ("dave") ("bill") "oren" (stuff "cody") { 
30 output($0); } ("dan") { output($2); } 

Here, however, $0 refers to ("dave") instead of (stuff 
"cody") , while $2 refers to (stuff "cody") instead of 
("dan"). To maintain correct variable assignment, the 
variable references must be renumbered as follows: 
35 <page> ::= ("dave") ("bill") "oren" (stuff "cody") { 

output (S2); } ("dan") { output ($3); } 
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The third step in the preferred embodiment of the code 
generation process converts the processed and expanded RHS 
into an FSA by performing a postorder traversal of the parse 
tree representing the RHS of the top-level rule, during which 
5 each node is in turn converted into an FSA. The FSAs have 
several types of states, most importantly: 
1 Char branch: Branch to one of 256 possible successor 
states depending on the current character and advances the 
current read position; 
l0 2. Action: Executes a block of action language statements; 
3. Marker: Records the current input string position, plus 
or minus a certain offset, on a mark stack; 

4 success: When reached, parsing has succeeded; 

5 Fail: When reached, the current attempt at parsing has 
15 failed, and the FSA must either backtrack to the prior 

backtracking point or fail if no backtracking points 

available j * 

6 Push: Pushes a configuration comprising an FSA state and 
the current input string position, onto the backtracking 

20 stack for later backtracking. 

.nri» Genera* stm lr Preprocessing 

The preferred pre-processing of stuff nodes begins by 
expanding or flattening any nested sequences, i.e., 
25 seguenceo elements nested within other Sequenceo elements. 
This is done according to the following process: 

FOR nonterminal PERFORM a POSTORDER traversal of its 

parse tree 

At each parse tree node v 

IF (v=Sequence<c_l,...,c_n>) THEN 

WHILE (there is an i, K=i<=n, such that 
c_i=seguence<d_l,...#d_m>) DO 

Replace node v with a new node 
Sequence <c_l, • • • , c_i-l, d_l, 



30 



35 



d_m, c_i+l, •••» c _ n> 
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Next, each occurrence of a stuff node is replaced by a 
Stuf f_And_String node that also accounts for the following 
literal string and any intervening parentheses • Every 
occurrence of "stuff" in a wrapper must be followed by a 
5 literal string with the semantic result that "stuff" matches 
all characters from the current position up to, but not 
including, the first occurrence of the following literal 
string. Only open and close parentheses can intervene 
between stuff and the literal string. In the preferred 
10 embodiment, this replacement is preferably performed 
according to the following process: 

FOR nonterminal PERFORM a POSTORDER traversal of its 
parse tree. 

15 At each node v of the parse tree 

IF (v«Sequence<c_l, • . . ,c_n>) THEN 

WHILE (there is an i, K=i<*=n, such that 

c_i=Stuff<>) DO 

Let j be the smallest j>i such that 
20 c_j«String<s> for some string s. 

IF there is no such j THEN signal an 
error. 

ELSE IF (any element ck (where i<k<j) is 
NOT either an OpenParenthesis or 
25 CloseParenthesis) THEN signal an 

. error 

/* cj is node for the string */ 

ELSE replace node v with a new node « 

Sequence <c_l , . . . , c_i-l , 
30 Stuf f_And_String<c _j ; c_i+l , . . . , 

c_j-l>, c_j+l, c_n> 

This process merges the stuff node, the literal string node, 
c _jf and an Y intervening parentheses nodes, — c_j-l, into 
35 the single node Stuf f_And_String<the-string; c_i+l, c_j 
l>. 
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Having completed pre-processing, nonterminals on the RHS 
of the top-level rule are preferably substituted and any 
action statement variables renumbered. Substitution and 
5 variable renumbering preferably use a process having two 
functions, EXPAND_REGEXP and EXPAND_ACTION and a global 
variable PAREN_COUNT . EXPAND_REGEXP is called recursively to 
substitute for RHS nonterminals, starting with a call to the ^ 
top-level nonterminal. PAREN_COUNT is initialized to 0 and 
10 as substitution proceeds, it is incremented for each -< 
encountered. PAREN.COUNT thus has a count of the total 
number of «(»s encountered so far. Then during substxtution, 
parentheses encountered are renumbered from their prior 
number to the current value of PAREN.COUNT and variables in 
15 action statements are renumbered in a corresponding fashion. 
For example, if PAREN_COUNT is currently 8, an 
0penParenthesis<2> node is replaced by an openParenthesxs<8> 
node, and an output($2) statement is replaced by an 
output ($8) statement. EXPAND_ACTION performs action 
20 statement variable renumbering. These processes are as 
preferably performed according to the following: 

GLOBAL VARIABLE integer PAREN_COUNT; 

FUNCTION EXPAND_REGEXP< <nt> : nonterminal) : RETURNS a 
25 parse tree with renumbered variables 

/* an array of integers with new variable numbers 

*/ 
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LOCAL VARIABLE new_names[] 

LET T = parse tree for the RHS of rule for <nt> 
PERFORM a PREORDER traversal of T. 
At each node v of tree T 
S IF (v=OpenParenthesis<i>) THEN 

new_names[i] = PARENCOUNT; 
REPLACE node v with 

OpenParenthesis<PAREN_COUNT> 
PAREN_COUNT = PAREN_COUNT + 1; 
10 ELSE IF (v=CloseParenthesis<i>) THEN 

REPLACE node v with 

CloseParenthesis<new_names [ i] > 
ELSE IF (v«Nonterminal<x>) THEN 

REPLACE node v with EXPAND_REGEXP(x) 
X5 ELSE IF (v=\*.ction<x>) THEN 

REPLACE node v with EXPAND_ACTION(x, 
new_names[]) 

FUNCTION EXPAND_ACTION ( T : parse tree T for action 

language statement; new_names[] ) : RETURNS a parse 
20 tree with renumbered variables 

PERFORM a PREORDER traversal of the parse tree T. 
At each node v, 

IF (v denotes variable name $i (i an integer) 
THEN 

25 REPLACE node v with a new node with $i 

replaced by $(new_names[i] ) . 

code Generation Part 3: Creating A Finite State Automata 
The final step in the preferred embodiment of code 

30 generation creates an FSA representing the substituted and 
processed regular expression. Although algorithms for 
creating such FSAs are known, they have not in the past 
provided facilities, for example, for embedding action 
language statements in such FSAs. Accordingly, the following 

35 description focuses on those new features of the preferred 
embodiment of this process that are directed to supporting 
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action language statements and the other novel features of 

the reg-exp language. . . „ ._ 

an FSA output from this process starts its execution in 
an initial state with an input pointer set to the start of 
S the input string and then repeatedly executes one of srx 
basic procedures according to its current state. These 
Precedes set a new current state and typically effect one 
Tsore of the data structures accessed by the™. The six 
types of states and their procedures are es '°"-<™' 

10 (1) Char branch<next.state_0 next_state_255>: 

When rhl current state is a char.branch ^J^^ 
is selected according to the ASCII value, i, of the current 
Input charter as next.state.i. and the input pointer is 
advanced to the next character in the input string. 

<2> Marter<num. open/close_fl.g, offset, "^"" > . 
When the current state is a marker state, a record is pushed 
7 a «n-ti» stacK Known as the mart stack that contains the 
character position of a parenthesis encountered in the input 
string plus the indicated offset an. » ^» * ^« 
20 this parenthesis is open or close, and the next state 
Z next state. An exeaplary -art can push a record 
Indicating that the fifth open parenthesis occurred at the 

current input position. ...to: 
(3) Action<conpiled_action_statements. next_state>. 

state a string has been successfully matched and this FSA 

30 terminates. current state is 

(5) push<state, next_state>. When me 
a push state, the current configuration of the PSA is pushed 
onto a ™-tU. StacK Known as the backtracking StacK, and 
r^xTstate is set to next.state. An FSA configuration is 
« a recotd containing identif ication of at least the current 
state together and the current position in the input string 
^h is used to support the non-deterministic constructs in 
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the regular expression language. An exemplary non- 
deterministic construct is the disjunction "reg-exprl | reg- 
expr2, w which is matched by an FSA that, first, pushes a 
backtracking state onto the backtracking stack, then two, 
5 attempts to run the FSA corresponding to reg-exprl; if that 
fails, then it backtracks and (c) attempts to run the finite 
state machine corresponding to reg-expr2. 

(6) Failureo: When the current state is the failure 
state, a backtrack record is popped off the backtracking 
10 stack and the next state and current input pointer are set to 
the contents of the backtrack record. If the backtracking 
stack is empty, the FSA has failed to match the input string 
and terminates* 

Creation of the FSA for an input regular expression is 
15 done using a postorder traversal of the previously produced 
parse tree. As the parse tree is traversed , to each node v 
is attached a finite state machine that matches the 
sub-expression represented by the sub-tree rooted at v. This 
attached finite state machine is the value of the variable 
20 v. machine. The final FSA output is the FSA that is attached 
to the root of the parse tree when the creation process 
completes. 

During the traversal, certain finite state machines are 
created in accordance with methods already known to those of 

25 skill in the art. These machines recognize standard regular 
expression constructs and can be constructed by known 
methods. General references for the creation of such 
standard finite state machines include the following: Aho et 
al., 1974, The Design And Anal ysis Of Computer Algorithms , 

30 Addison Wesley Publishing Co., chapter 9; Aho et al., 1986, 
compilers Principles. Tech niques, and Tools. Addison Wesley 
Publishing Co., chapter 3; Hopcroft et al., 1979, 
introduction to Automata Theory, Languages, and ComPtttfltiPn, 
Addison-Wesley Publishing Co-, Section 2.5; Sedgewick, 1990, 

35 Algorithms in C , Addison-Wesley Publishing Co., chapter 16. 
Further the process disclosed adopts certain 
construction methods that can advantageously by changed in 
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alternative embodiment. For example, the machine 
representing a literal string can be constructed according to 
the Boyer-Moore or other string matching algorithm. See, 
e.g., Sedgewick at chapter 19. 
5 The preferred process is: 

PERFORM a POSTORDER traversal of the tree. 
At each node v, DO: 
IF (v«OpenParenthesis<n>) THEM 
10 SET v.machine = Markers, "open", 0, 

Success<» 
ELSE IF (v=CloseParenthesis<n>) THEN 

SET v.machine « Marker<n, "close", 0, 
Success<» 

ELSE IF 

( v-Action<node_for_the_action_language_statements» 



15 ELSE IF 

THEN 



20 



SET v.machine = 

Action<compiled_action_statements , 

Success<» 

ELSE IF (v=String<literal_string» THEN 
SET len = length (the-string) 
FOR i - 1 to len DO 

branch_state[i] = Char_branch <s_0, s_l, 
s 255>, where all the sj's are 
25 Failureo EXCEPT the s_j where j is 

the ASCII code for the i'th 
character of the-string which is 
Successo 

3o SET v.machine = MAKE_SEQDENCE 

(br anch_state 1 1 1 , 
branch_state [ len] ) . 



• . » , 



35 
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ELSE IF 

(v«Stuff_And_String<the-string,par_l, . . . ,par_k>) 
THEN 

/* m_l can be constructed according to known 
5 methods disclosed in the previously cited 

references */ 

SET m_l = a finite state machine that scans 
the input string for the first occurrence of 
the-string and stops as soon as it comes to 
10 the end of this first occurrence. 

FOR i = 1 to k DO 

SET par_i. machine = marker state 

appropriate to parenthesis par_i and 
having an offset of 
15 -length (the-string) . 

SET v. machine = MAKEJSEQUEWCE (m_l, 

par_l . machine , . . . , par_k . machine ) . 
ELSE IF (v=Sequence<element_l, element_k>) 
THEM 

20 FOR i = 1 to k DO 

SET elemental. machine = machine 

constructed according to this 
process for element_i 
SET v. machine = KAKE_SEQUENCE 
2 5 (elemental . machine , . . . , 

e lementjc • machine ) 
/* a disjunction, "| M */ 

ELSE IF (v==Alternatives<element_l, . element_k>) 
THEN 

30 FOR i = 1 to k DO 

SET elemental. machine = machine 

constructed according to this 
process for element_i 



35 
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SET v.machine = eleae-tjc. machine 
FOR i = K-l downto l -*0 

v.machine « Push<v. machine, 
element_i . machine> 

/* type is •?», or W * M *' 

ELSE IF (v=Repetition<type, element>)) THEM 

SET element.machine - machine constructed 

according to this process for element 
SET v.machine - a finite state machine 
constructed according to methods 
tt disclosed in the previously cited 

references for this type of repetition 
element .machine 

. u»mr SEOUEHCE function that Builds a 
LS This process uses a KUKE.orfJOEHCt 

composite machine tree . series of ' 
co^osite machine executes each of the s«h-»chines » 
.^uence until one sub-machine feils. in which case the 
opposite machine aiso fails, or all the sub-aching 
succeed, in which =«= the composite machine succeeds, 
other words, the composite machine runs the first sub 
M chine ; if that succeeds, it runs the se~nd. "J^. 
succeeds, it runs the third, and so on. If any sun 
"its. i.e., reaches a Failureo state, then composite 
25 machine also fails. 

FUNCTION MAKE_SEQUENCE (machine m_l, m_2 

: RETURNS machine, 
newjm ■ m_l» 
FOR i - 2 to k DO 
30 Traverse the states of newm and replace 

any Successo states with m_i 
RETURN new _m 



35 



This concludes the preferred construction of FSAs 
presenting regular expressions from the reg-e* ^language of 
^ is invention. Next, the FSA execution is descrxbed. 
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ft^l-TIME INTERPRETAT ION OF REGULAR EXPRESSIONS 

In a preferred embodiment, the FSA representing a 
regular expression is executed by interpreting the structure 
created in the previous code generation steps. The 

5 interpreter preferably accesses several data structures 
including the current state of the FSA, a pointer to the 
current character position of the input string, a 
backtracking stacK and a marker stack. The backtracking 
stack contains records characterizing the state of 
10 interpretation of the FSA and is used in a manner known in 
the art to implement backtracking, in case an attempted 
partial match of the input string fails. Configuration 
records include the current state and the current input 
position. In an alternative embodiment, push states can be 

15 avoided by implementing states with a list o£ possible next 
states . 

The preferred implementation of the novel variable 
binding and action statement features of the reg-exp language 
of this invention utilizes an additional stack, the mark 
20 stack. The semantics of these features require that no 
actions be executed until the entire regular expression 
matching process succeeds. Upon success, all actions are 
then executed with the variable bindings that occurred during 
parsing. This means that a single variable, e.g. $1, can 
25 have different bindings in the same action statement which 
can be executed multiple times on match success. 

This semantics is implemented in the preferred 
embodiment using the mark stack in the following manner. 
When the current interpreted state is an Actiono state, the 
30 action language statement is not immediately executed. 
Rather the action statement code is pushed onto the mark 
stack. Similarly, when the current state is a mark state, 
which represents a parenthesis in the regular expression 
definition, the information present in the mark state is 
35 pushed onto the mark stack. This information permits finding 
the position in the input string of the beginning or ending 
of a current variable binding. To prevent execution of 
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configuration and th. configuration record on tha 
5 lltTaOclng stacK further inciudas tha currant position m 

~ Tti'onTanguag. stated ara a-cacutad upon final -** 

stack from bottom to top. In 
r1 ,. ro( . fi k v scanning the marK srac* j-j-wiu 

JTsca^ variable bindings ara aat as indicated by -* 

10 st « Tacords »d actions ara executed aa indicated by actron 

" "ate JZZ. m caaa of »tcn failure, no action language 

statements are executed. starred 
in more detail, an embodiment of the preferred 
implementation functions as follows: 

GLOBAL VARIABLE current.state - initialized to the 

initial machine state addrcss 
GLOBAL VARIABLE inputjpos = initialized ro 
of the beginning of the input string 
20 GLOBAL VARIABLE BT STACK ~ initialized to an ****** 

GLOBAL VARIABLE MARK.STACK - initialized to an empty 

stack 

HEPEAT UNTIL one of the following clauses exits 
CASE (current_state) IS 
, s char branch<state_0, state_255>. 

"current_state - state_(ASCII value of the 
current input character) 
inputjpos - inputjpos + 1 



30 
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Marker<num, open/close_f lag, of f set , nextstate> : 
push ("mark" ,num, open/ close_f lag, 

inputj?os+of f set) onto MARK_STACK 
current_state = nextstate 
S Act ion<compiled_act ion_code , nextstate> : 

push ("action" , compiled_action_code) onto 

MARK_STACK 
current_state = nextstate 
Push<state , nextstate> : 
10 push (state, input_pos, height (MARKJ5TACK) ) 

onto BT_STACK 
current_state = nextstate 
/* exit the REPEAT statement on final success */ 
Successo: GOTO done 
15 Failureo: 

/* match finally fails if backtracking is 
attempted and the backtrack stack is empty */ 
IF BT_STACK is empty THEN 
RETURN "fail" 

20 ELSE 

pop BT_STACK record into variables st, 
ip, msh 

current_state = st 
inputjpos = ip 

25 pop KARKJ3TACK down to height msh 

END CASE 

/* when parsing succeeds, scan MARK_STACK and execute 
actions with variable bindings indicated by the marks */ 
done : 

30 LOCAL VARIABLE openjnarks [ ] , close_marks[ ] 

FOR element = bottom of MARK_STACK up to top of 
MARK_STACK DO: 

IF (element=("mark",num, "open", pos)) THEN 
SET openjnarks [num] = pos 
35 IF (element** ("mark" ,num, "close" ,pos) ) THEN 

, SET close_marks(num] * pos 
IF (element^ ("action" , compiled_action_code) ) THEN 
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SET values of $0, $1 to tne Positions 

in the input string indicated by 
openjaarks[n] and close_marks[n] 

RUM coB P iled_action_statement using these 
bindings of $0, $1, • $ n 

END FOR 

/* match succeeded and all action executed */ 
RETURN "succeed 11 

„ yor erample. in the preceding POP. loop, the binding of M is 
set to the sub-strin, of the input string beginning at 
cpenmarksl2] and ending at close_«arte t 2) . 

„ *^^;^™7 e ^Xdi.eIt. action language statements 
«e compiled into . variable length bytecode-ty, ^ of 
intermediete language that is —ted at run-ti* .by a 
Leoode interpreter. The previously desoribed action 

Wishing Co.. ohapter 4. During parsing all ™£ bles 
„!», in the wrappers are assigned 
25 preferred suoh assist assies segaentral 

integers to named variables, e.g., S*, Sabo, in r 
Cere enoountered, and assies serial negative 
integers to the numeric variables denoting matched 
integers w numeric variable 

suh-strings, e.g.. So, «•< » 2 - etc. , u 

- w "2L=r cirr^-by ^ rr - =n 

trenslation techniques. Adequate syntax directed 
techniques are presented with exsnples in «ho et el. at 
ch.^.r 8 . The preferred intermediate code is presented 
« below in this presentation, intermediate language 
" Actions have the for-st of en instruction code, with 
Lionel modifying infection, followed by zero of more 

- 60 - 



WO 98/12881 



arguments. Variables are denoted by <var> where var is the 
2 -byte integer coding of the variable. Relative branch 
offsets are encoded as 4 -byte integers. All branches are 
relative to the current instruction address, i.e., the 
5 address of the branch instruction itself, so the bytecode is 
relocatable. 

1. Function Calls: 

Encoding: 0 funcode numargs <var_0> <var_l> ... 

10 <var_numargs> 

Meaning: varO is assigned the result of the function 
identified by "funcode" on the arguments var_l, 
var_numargs of number "numargs." Each built-in 
function, e.g. agrc, argv, etc., and each operator, 

15 e.g., K +", f, - w , etc., ot the language has its own 

unique integer "funcode". 

2. Branch Instructions: 

Encoding: 1 <offset> <var0> 

Meaning: if varO is true, branch by amount offset 
20 relative to the branch instruction 

Encoding: 2 <offset> <var0> 

Meaning: if varO is false, branch by amount offset 

relative to the branch instruction 
Encoding: 3 <offset> 
25 Meaning: always branch by amount offset relative to the 

branch instruction 

3. Load Constant Instructions: 

Encoding: 4 <var0> <null-terminated-string> 
Meaning: varO is assigned the null-terminated-string 
30 constant 

Encoding: 5 <var0> <f loating-point-constant> 
Meaning: varO is assigned the floating-point-constant 

4. Move Instructions: 

Encoding: 6 <var0> <varl> 
35 Meaning: varO is copied from varl 
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5. output Instructions: 

, ^tLinates the — tuple, causes it to be 
returned to the netbot 

machine for <nt» 
<nt> 

7. Exit Instructions: 

Encoding : 10 U1 
„ Meanly exit, the current actio,. "~>- 

m alternative eebodieents. different intonate codes 
or even no intersediate code can 

instead of Previously disclose '^^a interstate 
-«,< v »ientlv a reverse Polish stacit dw« 
10 aq ! a ^ "sld optionally, no intermediate code is 

code could be used. opt ' cospiled directly to 

and action iotW statesents are c P 
a machine language. Both these alternative can^ 
implemented by usin, the disclosed synt^-dxrectec 

25 see, e.g.. MM. et al. -erforned in a preferred 

section of motions ^ perf ^ to 

is — — - : ssrrrrs 

» -r • ~ TpTi Ur^ — XOK is initialized to 
V !\ first instruction of the bytecode to be 

retrieves the bytecooe pox .„.„„, i™ to the by tecode 

,5 perf ores the coded simple action aocord^ 

lan^e previously ^^^TZ ne« instruction. 
CURRENT_INSTRUCTION to point either 
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or for taken branch instructions, is modified by the offset 
value. The interpreter finishes execution upon encountering 
an exit instruction. 

The parse built-in function is executed at run-time by 
5 calling the FSA interpreter to interpret the compiled regular 
expression code specifying the finite state machine. 
Output (a_l, ... ,a_n) statements are translated and then 
executed by executing bytecode instruction code 7 for each 
variable in the output list followed by a bytecode 
10 instruction code 8 that terminates the current tuple, i.e., 
marks the end of the output statement. 

6. SPECIFIC EMBODIMENTS . CITATIO N OF REFERENCES 
The present invention is not to be limited in scope by 
ia the specific embodiments described herein. Indeed, various 
modifications oi the invention in addition to those described 
herein will become apparent to those skilled in the art from 
the foregoing description and accompanying figures. Such 
modifications are intended to fall within the scope of the 
20 appended claims. 

Various publications are cited herein, the disclosures 
of which are incorporated by reference in their entireties. 



25 



30 



35 
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WHAT IS CLAIMED IS: 

, a method tor assisting a user to query for information 
avaflaole fro. i„for»tion sources attached to a 

infection source accord to a oescrxption of «ch 
„ said relevant information source written in a wrapp 

description language! . 
c! transmitting said formatted query to each of saxd 

r.l.vant information sources, 

.^acting data fields relevant to said user gu *y 

- ==5. 

cHaid relevant inf oration source returning each said 
rTrele-ino said relevant data fields to said user. 
20 ^ i further comprising, after said 

rr= :u^.u°r«ti»u - — - 

said responses to said user query. 
25 3 . The method of claim X where said selecting step further 
comprises the steps of: user 
a. determining one or more concepts to whxch 

query relates; and if sa id 

T selecting en information source as relevant 
,„ information source relates to said concepts. 



4. 



15 



* „f claim 1 wherein said wrapper description 
The method of claim 1 wn specification of 

language comprises fa- it- for ^ p 

-rre^s^ucn that upon ~ ^tionofohe 
-r^SZTJL^T^ houod 
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as of the time of recognition of said component regular 
expression, said actions actually being executed only if 
said entire regular expression is also recognized. 

5 5. A system for assisting a user to query for information 
available from information sources attached to a 
network, said system comprising: 

a. a first processor means for selecting the one or 
more information sources most relevant to a user query; 

10 b. a second processor means for formatting said user 

query for each said relevant information source 
according to a description of each said relevant 
information source written in a wrapper description 
language, and for transmitting said formatted query to 

15 each of said relevant information sources; 

c. a tviird processor means for extracting data fields 
relevant to said user query from responses returned from 
said relevant information sources, said extracting 
according to said description of said relevant 

20 information source returning each said response; and 

d. a fourth processor means for presenting said 
relevant data fields to said user. 

6. The system of 5 wherein said first, said second, said 
25 third, and said fourth processor means reside in one 

network attached computer, said computer comprising a 
CPU and memory. 

7. The system of 5 wherein said first, said second, said 
30 third, and said fourth processor means reside in 

separate network attached computers, each said computer 
comprising a CPU and memory 

8. The system of 5 wherein said fourth processor means 

35 presents said relevant data fields by executing a World 

Wide Web browser. 
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